A comprehensive in silico analysis of the deleterious nonsynonymous SNPs of human FOXP2 protein

FOXP2 encodes the forkhead transcription factor that plays a significant role in language development. Single nucleotide polymorphisms in FOXP2 have been linked to speech- language disorder, autism, cancer and schizophrenia. So, scrutinizing the functional SNPs to better understand their association in disease is an uphill task. The purpose of the current study was to identify the missense SNPs which have detrimental structural and functional effects on the FOXP2 protein. Multiple computational tools were employed to investigate the deleterious role of non-synonymous SNPs. Five variants as Y531H, L558P, R536G and R553C were found to be associated with diseases and located at the forkhead domain of the FOXP2 protein. Molecular docking analysis of FOXP2 DNA binding domain with its most common target sequence 5’-CAAATT-3’ predicted that R553C and L558P mutant variants destabilize protein structure by changing protein-DNA interface interactions and disruption of hydrogen bonds that may reduce the specificity and affinity of the binding. Further experimental investigations may need to verify whether this kind of structural and functional variations dysregulate protein activities and induce formation of disease.


Introduction
One of the members of the winged-helix (FOX) family of transcription factors is FOXP2 that plays a crucial role in language processing and cognition [1]. During fetal development, the protein is expressed highly in the brain and several other parts of the body like the gut and lung [2]. The protein is characterized by a DNA-binding domain of forkhead box, large polyglutamine sequence, a conserved leucine zipper and zinc finger, acidic C-terminus and a dimerization domain [3]. The protein act as a transcription factor which is evolutionarily very conserved and may bind to near about 300-400 gene promoters in the human genome. Being a transcription element, FOXP2 can control a number of genes named DISC1, CNTNAP2 and SRPX2/μPAR, that are likely to be considered as important players for speech and language development [4]. An experiment on mice suggested the role of FOXP2 in the process of neuron formation, differentiation and migration [5]. The neural correlation of FOXP2 was shown in patients with verbal dyspraxia caused by disruption of the FOX domain [6]. All sorts of mutations have been described in FOXP2 gene. For example, missense mutations [7], non-sense mutations [8], deletions [9,10] and indels [11] have been discovered in different cases and families [12]. Single nucleotide polymorphism in the enciphering area of the FOXP2 gene leads to problems like speech-language disorder 1 (SPCH1) and developmental apraxia of speech (DAS) with impaired language communication and comprehension [13]. Polymorphisms of the gene, located at the chromosome 7q31, have also been related with frontotemporal lobar degeneration [14], schizophrenia [15], autism disorder [16] and development and progression of various types of cancer such as ovarian [17], breast [18] and prostate [19] cancer.
Single nucleotide polymorphisms (SNPs) in the human genome are most available genetic variations that are single base pair changes found in every 200-300 base pairs and act as genetic markers [20]. Approximately 0.5 million SNPs, that reside in the open reading frames of the human genome [21] have drawn much interest because they produce a large number of amino acid variations that lead to functionally diverse protein variants many of which eventually lead to diseases [22]. Several previous investigations have shown that more than fifty percent of the variations associated with hereditary genetic disorders are caused by substitution of amino acids known as non-synonymous SNPs (nsSNPs) [23]. These functional variations can exert deleterious or neutral effects on protein structure or function [24]. Detrimental effects of nsSNPs might cause destabilizing protein structure, alteration of gene regulation, changing ligand-binding site [25], changing protein hydrophobicity, geometry, charge [26], modify dynamics, stability, protein-protein interactions and alter translation, resulting to threatening the cellular structural integrity [27]. It has also been reported that the involvement of multiple nsSNPs stimulate the possibility of infections, autoimmune diseases and the expansion of inflammatory disorders [20].
Though over the past couple of years, various computational methods and techniques have been applied for sorting and detecting the effects of disease-associated SNPs in other subfamilies of the target protein, limited investigations have been performed on FOXP2 protein. Considering the significant role of the protein, the present research work has been designed to identify the deleterious nsSNPs and assess their pathogenic effects on the protein using various in silico algorithms. Since FOXP2 regulates the transcription of various target genes; molecular docking was performed to understand the effects of the variants on target DNA binding that may lead to altered gene regulation.

Retrieval of missense SNPs
The list of all missense SNPs of FOXP2 gene, relevant information (reference SNP ID, position, changed amino acid residue and protein accession number) of the gene and corresponding protein were retrieved from NCBI dbSNP (https://www.ncbi.nlm.nih.gov/projects/SNP/) and UniProt (https://www.uniprot.org/) databases [28]. Only the nsSNPs were considered for further exploration.

Identifying the most detrimental nsSNPs
The substitution of amino acid at a particular position often greatly affects the function of the protein. Five various computational tools-PROVEAN [29], PolyPhen-2 [30], SNPnexus [31], SNAP2 [32] and PON-P2 were employed to anticipate the effect of nsSNPs on FOXP2 function [33]. Based on position-specific independent count (PSIC) scores, PolyPhen-2 estimates the impact of mutation on protein at both structural and functional levels. The difference between the scores clarifies the nsSNPs into benign, possibly damaging and probably damaging [34]. An open accessible web server PROVEAN considers the list of mutant variants and generates a score after homology searching. The threshold value �-2.5 labelled the SNPs into neutral or deleterious [29]. SNPnexus is comprised of with the combination of SIFT and Poly-Phen tools [31]. SIFT defines the variants as deleterious when the tolerance index score is �0.05. Accessible properties of the wild and mutant protein were compared by the neural network based in silico tool SNAP2. The tool generates a heat map and differentiates the variants as effect or neutral providing a confidence score [32]. PON-P2, a machine learning-based server, computes random forest probability score to categorize the harmful variants into unknown, neutral or pathogenic [33].

Prediction of disease related SNPs
Both SNPs&GO (DOI: 10.1002/humu.21047) and PhD-SNP (DOI: 10.1093/bioinformatics/ btl423) online tools were applied to predict whether SNPs are disease associated or not. Support vector machine-based method SNPs&GO predicts the relationship of SNPs with disease at 81% accuracy. A probability score �0.5 points out that SNPs are related with clinical complications [35]. An online tool, PhD-SNP was also used to assess the connection of SNPs with diseases and classify them into neutral or disease associated on a 0-9 grade reliability index score [36].

Alteration of structural and functional characteristics
MutPred 2 (http://mutpred.mutdb.org/) was used to predict the impact of disease associated SNPs based on the changes in 14 different biophysical properties. The tool evaluated the possibilities of addition or loss of some of the features providing a p-values. The p-values <0.05 and <0.01 were denote as significant and very significant output respectively [37].

Protein stability prediction
A pair of software was used to check whether alterations of amino acid affect the protein stability. I-Mutant 2.0 (http://folding.biofold.org/i-mutant/i-mutant2.0.html) utilizes support vector machine to analyze any change in the stable state of the protein. The tool was derived from ProTherm which is a broad dataset of experimental data on protein mutations [38]. Conditions for all selected inputs were fixed at temperature 25 0 C and pH 7.0. The outcomes provide a free energy change (DDG) and unfolding free energy value of the mutated and wild proteins [36]. MUpro (https://www.ics.uci.edu/~baldig/mutation.html) identifies the impact of singlesite mutations in protein stability. A confidence score <0 dictates the reduction in stability of the protein while the score >0 interprets the opposite effect [39].

Evolutionary conservation analysis
ConSurf (https://consurf.tau.ac.il/) is an effective tool that addresses the evolutionary pattern of conservation at each amino acid site [40]. By using the Bayesian calculation method, the server analyzes close sequence homologues and estimates the phylogenetic relationship [40]. For each residue of the candidate protein, the degree of conservation is estimated on a scale of 1 to 9 and classifies these as variable (1-4), intermediate (5)(6) and conserved (7-9) [41].
induce allosteric or orthosteric effects that lead to shifts in energy conformations and stabilization, a deep understanding of the structure encompassing PTM sites helps to clarify the impact of PTMs on protein folding [43]. Phosphorylation, methylation, ubiquitination, acetylation, Nlinked glycosylation and palmitoylation are some of the remarkable PTM that play crucial role in the study of diseases [44]. ModPred (http://www.modpred.org/), a sequence-based predictor, was applied to predict the overall influence of highly risky nsSNPs on the PTM of the target protein. This database evaluates the tendency of a specific amino acid to be modified [43].

Molecular docking between FOXP2 variants and the consensus DNA motif
RaptorX, a deep-learning based structure prediction tool was applied to build the three dimensional structures of the wild type as well as the variants (L558P, R536G, R553C and Y531H) of FOXP2 DNA binding domain (DBD) [45]. Quality of the structures were assessed using ProSA-web (PMID: 17517781) and Ramachandran plot through PDBsum (PMID: 9433130) at the EMBL-EBI site. 3D structure (in Protein Data Bank or pdb format) of the most common target site of FOXP2 (5'-CAAATT-3') in B-form was generated using the web-based DNA Sequence to Structure conversion tool [46] at the Supercomputing Facility for Bioinformatics and Computational Biology, IIT, Delhi. Binding of FOXP2 DBD 3D models to CAAATT was predicted with HDock (PMID: 28521030). DNA docked structures of FOXP2 were further analyzed using Discovery Studio (v20.1.0.19295) tool (Discovery Studio Visualizer v20.1.0.19295, San Diego, USA) to identify the interacting residues of FOXP2 with its consensus DNA motif.

Retrieval of nsSNPs dataset
The dbSNP database of NCBI provides a total of 144643 SNPs data for the FOXP2 protein.
Out of 144643 SNPs, 141823 were reported to be present in intron region (98.05%), 393 were missense SNPs (0.27%), 240 were synonymous (0.17%) and 2353 were non-coding transcript variant (1.63%). Since the non-synonymous SNPs often alter the encoded amino acid, the present study only considered these SNPs for further analysis. The detail information about these SNPs was given in S1 Table. The proportions of each of the SNPs category are represented in graphical form (Fig 1).

Identification of high-risk SNPs
Five computational algorithms like PolyPhen2, PROVEAN, SNPnexus, SNAP-2, PON-P2 were applied in the present study to identify detrimental nsSNPs in human FOXP2 protein.
The identification outcomes with these tools are presented in Fig 2. Among the tools, Poly-Phen-2 anticipated 257 missense SNPs as probably or possibly damaging while PROVEAN determined 136 nsSNPs have deleterious effect on the proteins function. From the SIFT data of SNPnexus, 123 nsSNPs were found to be damaging. SNAP2 identified nonneutral 190 nsSNPs that might cause change in protein's activity while PON-P2 estimated 109 nsSNPs as pathogenic. The list of these SNPs is provided in S2 Table. Out of the 393 missense mutations, potent 43 nsSNPs that are found to be highly risky in all of the prediction tools were prioritized for further analysis and listed in Table 1.

Identification of disease associated deleterious SNPs
The deleterious polymorphisms were then subjected to SNPs&GO and PhD-SNP servers to predict the variants that are potentially associated with diseases or not. Out of the 43 high confidence nsSNPs, SNPs&GO and PhD-SNP server designated 11 and 14 nsSNPs to have potential role in diseases. Among these, 5 nsSNPs were common in both tools and selected for further investigations (Table 2).

Functional and structural modification prediction
The selected disease related nsSNPs were analyzed by MutPred 2 server to understand their molecular mechanism of pathogenicity ( Table 3). The predicted modifications of structural and functional properties include alteration of disordered interface, transmembrane protein,

Effect on protein stability
Both I-Mutant 2.0 and MUpro servers were used to interpret whether the protein will be in stable or denatured form due to point mutation. The result expressed either in the free energy change value (ΔΔG) or in the sign of DDG where positive and negative DDG value indicates increased or decreased stability, respectively. All of the variants represented in Table 4 have negative DDG value indicating that they reduce the stability of the variants.

Prediction of post translational modifications
Out of 5 nsSNPs, the Modpred server reported only 2 nsSNPs (Y531H and R536G) to be present at PTM sites and 1 variant (R553C) at amidation sites. Table 4 depicts the post translational modification (PTM) of altered amino acids.

Sequence conservation study of nsSNPs
The variants Y531H, L558P, R536G and L291P were reported by consurf to be highly conserved as structural and buried and R553C was predicted to be conserved being functional and exposed. The server estimated the confidence level by giving a conservation score (1-9) for the conservation of sequence. The functional and structural consequences of 5 high-risk nsSNPs with their phylogenetic conservation scores are given in S1 Fig and Table 5.

Structural validation of wild and mutant models
The four mutants Y531H, L558P, R553C, R536G are present in the winged-helix DNA binding domain (DBD) of the FOXP2 protein. So, the native DBD structure of the protein and its four detrimental, disease associated variants were using RaptorX. Model quality validation output file of Ramachandran plot and ProSA Z-score were described in Table 6. The Ramachandran plot depicted that 96.1% residues were in most favored region whereas 0% in disallowed region among the 82 (503-584) residues in native DBD of the FOXP2 protein. Over 94% residues also occupy most favorable region in the predicted DBD structure of the 4 variants. ProSA Z-score and Ramachandran plot indicate the overall good quality of the 3D modelling. Docking analysis. The docking analysis of FOXP2 DBD with its most common target sequence reported that the mutant variants bind to the DNA sequence in a slightly diverged orientation than wild type variants of DBD. Upon binding, the native protein shows interaction with Arg 504, Tyr 509, His 554, Leu 558, Pro 505, Trp 548 and Ala 551residues. Positively  Table. The winged helix DNA binding domain of FOXP2 is crucial for analysis as majority of the disease-causing mutations are found in that region [47]. The mutation of arginine to histidine at 553th position in human FOXP2 has been associated with severe congenital speech impairement disorder which plays a vital role in the FOXP2-DNA interaction [48]. Most of the interactions between deoxyribonucleic acid and transcription factors are between the negatively charged phosphodiester bond of the DNA and positively charged amino acid residues of the protein [49]. Banerjee-Basu and Baxervanis in 2004 observed that the R553H mutation causes a net reduction of the electrostatic potential in the DNA binding surface of the protein. Due to the disturbance to the electrostatics of the protein the interactions between the backbone of the DNA and protein might be hampered. Several studies also suggest that electrostatic interactions also dominate the protein-DNA binding specificity [50]. In case of R553C mutation, a new electrostatic bond is formed between Phe 541 and Thy 10 0 which may change the electrostatic potential that likely led to imbalance in the organization of protein -DNA interface.
From various mutational investigations it was known that His554 plays a significant role in hydrogen bond formation between FOXP2 FHD and DNA [51]. This interaction was also conserved in other FOX DBDs like FOXO1, FOXK1 and FoxA1 [52]. The substitution of leucine to proline at 558 th position missed the interaction with His554, which may lead to the reduction of specificity of the protein with its cognate sequence. Since hydrogen bond stabilizes the protein-DNA interaction [53], the missing of most conserved H-bond might destabilize the interaction. The more interactions between the DNA and protein may predict the higher affinity of binding. Upon binding, the native protein shows interaction with Arg 504, Tyr 509, His 554, Leu 558, Pro 505, Trp 548, Ala 551 residues. L558P mutation reduces the interactions with other residues compared to native protein which may lead to lower affinity of protein-DNA binding.

Conclusion
With the combination of various in silico tools, the study determined the deleterious nsSNPs in FOXP2 protein. The outcomes predicted R553C and L558P mutants as the most damaging Table 6. The results of three-dimensional model validation.

PDBSum
Most favoured region (%) Additional allowed region (%) Generously allowed region (%) Disallowed region (%) ProSA web (Z-score) nsSNPs that may alter protein-DNA interface interactions, reduce the specificity and affinity of the binding. This kind of structural and functional alterations is likely to interrupt regulation of protein and initiate formation of disease. In-depth experimental verification is needed to determine the role of mutant variants more precisely.